Red Wines Exploratory Analysis by Saqib Ali

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We can see that quality has kind of rating number, which can be converted into a factor

##  Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 8.40   3: 10  
##  1st Qu.: 9.50   4: 53  
##  Median :10.20   5:681  
##  Mean   :10.42   6:638  
##  3rd Qu.:11.10   7:199  
##  Max.   :14.90   8: 18

We have about 1599 observations.From looking at summary, we can see that some variables max value is much higher than the thrid quartile value, which gives us hint about outliers e.g fixed.acidity , volatile.acidity etc. We should check for outliers in our analysis.

Let us also summarise the data for quality, the most important factor.

## rwines$quality: 3
##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   : 460.0   Min.   : 6.700   Min.   :0.4400   Min.   :0.0000  
##  1st Qu.: 726.5   1st Qu.: 7.150   1st Qu.:0.6475   1st Qu.:0.0050  
##  Median :1100.0   Median : 7.500   Median :0.8450   Median :0.0350  
##  Mean   :1053.2   Mean   : 8.360   Mean   :0.8845   Mean   :0.1710  
##  3rd Qu.:1446.2   3rd Qu.: 9.875   3rd Qu.:1.0100   3rd Qu.:0.3275  
##  Max.   :1506.0   Max.   :11.600   Max.   :1.5800   Max.   :0.6600  
##  residual.sugar    chlorides      free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :1.200   Min.   :0.0610   Min.   : 3.0        Min.   : 9.0        
##  1st Qu.:1.875   1st Qu.:0.0790   1st Qu.: 5.0        1st Qu.:12.5        
##  Median :2.100   Median :0.0905   Median : 6.0        Median :15.0        
##  Mean   :2.635   Mean   :0.1225   Mean   :11.0        Mean   :24.9        
##  3rd Qu.:3.100   3rd Qu.:0.1430   3rd Qu.:14.5        3rd Qu.:42.5        
##  Max.   :5.700   Max.   :0.2670   Max.   :34.0        Max.   :49.0        
##     density             pH          sulphates         alcohol      
##  Min.   :0.9947   Min.   :3.160   Min.   :0.4000   Min.   : 8.400  
##  1st Qu.:0.9961   1st Qu.:3.312   1st Qu.:0.5125   1st Qu.: 9.725  
##  Median :0.9976   Median :3.390   Median :0.5450   Median : 9.925  
##  Mean   :0.9975   Mean   :3.398   Mean   :0.5700   Mean   : 9.955  
##  3rd Qu.:0.9988   3rd Qu.:3.495   3rd Qu.:0.6150   3rd Qu.:10.575  
##  Max.   :1.0008   Max.   :3.630   Max.   :0.8600   Max.   :11.000  
##  quality
##  3:10   
##  4: 0   
##  5: 0   
##  6: 0   
##  7: 0   
##  8: 0   
## -------------------------------------------------------- 
## rwines$quality: 4
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :  19   Min.   : 4.600   Min.   :0.230    Min.   :0.0000  
##  1st Qu.: 262   1st Qu.: 6.800   1st Qu.:0.530    1st Qu.:0.0300  
##  Median : 831   Median : 7.500   Median :0.670    Median :0.0900  
##  Mean   : 797   Mean   : 7.779   Mean   :0.694    Mean   :0.1742  
##  3rd Qu.:1262   3rd Qu.: 8.400   3rd Qu.:0.870    3rd Qu.:0.2700  
##  Max.   :1522   Max.   :12.500   Max.   :1.130    Max.   :1.0000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.300   Min.   :0.04500   Min.   : 3.00      
##  1st Qu.: 1.900   1st Qu.:0.06700   1st Qu.: 6.00      
##  Median : 2.100   Median :0.08000   Median :11.00      
##  Mean   : 2.694   Mean   :0.09068   Mean   :12.26      
##  3rd Qu.: 2.800   3rd Qu.:0.08900   3rd Qu.:15.00      
##  Max.   :12.900   Max.   :0.61000   Max.   :41.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9934   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 14.00       1st Qu.:0.9957   1st Qu.:3.300   1st Qu.:0.4900  
##  Median : 26.00       Median :0.9965   Median :3.370   Median :0.5600  
##  Mean   : 36.25       Mean   :0.9965   Mean   :3.382   Mean   :0.5964  
##  3rd Qu.: 49.00       3rd Qu.:0.9974   3rd Qu.:3.500   3rd Qu.:0.6000  
##  Max.   :119.00       Max.   :1.0010   Max.   :3.900   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 9.00   3: 0   
##  1st Qu.: 9.60   4:53   
##  Median :10.00   5: 0   
##  Mean   :10.27   6: 0   
##  3rd Qu.:11.00   7: 0   
##  Max.   :13.10   8: 0   
## -------------------------------------------------------- 
## rwines$quality: 5
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 5.000   Min.   :0.180    Min.   :0.0000  
##  1st Qu.: 298   1st Qu.: 7.100   1st Qu.:0.460    1st Qu.:0.0900  
##  Median : 713   Median : 7.800   Median :0.580    Median :0.2300  
##  Mean   : 742   Mean   : 8.167   Mean   :0.577    Mean   :0.2437  
##  3rd Qu.:1189   3rd Qu.: 8.900   3rd Qu.:0.670    3rd Qu.:0.3600  
##  Max.   :1598   Max.   :15.900   Max.   :1.330    Max.   :0.7900  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.200   Min.   :0.03900   Min.   : 3.00      
##  1st Qu.: 1.900   1st Qu.:0.07400   1st Qu.: 9.00      
##  Median : 2.200   Median :0.08100   Median :15.00      
##  Mean   : 2.529   Mean   :0.09274   Mean   :16.98      
##  3rd Qu.: 2.600   3rd Qu.:0.09400   3rd Qu.:23.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :68.00      
##  total.sulfur.dioxide    density             pH          sulphates    
##  Min.   :  6.00       Min.   :0.9926   Min.   :2.880   Min.   :0.370  
##  1st Qu.: 26.00       1st Qu.:0.9962   1st Qu.:3.200   1st Qu.:0.530  
##  Median : 47.00       Median :0.9970   Median :3.300   Median :0.580  
##  Mean   : 56.51       Mean   :0.9971   Mean   :3.305   Mean   :0.621  
##  3rd Qu.: 84.00       3rd Qu.:0.9979   3rd Qu.:3.400   3rd Qu.:0.660  
##  Max.   :155.00       Max.   :1.0031   Max.   :3.740   Max.   :1.980  
##     alcohol     quality
##  Min.   : 8.5   3:  0  
##  1st Qu.: 9.4   4:  0  
##  Median : 9.7   5:681  
##  Mean   : 9.9   6:  0  
##  3rd Qu.:10.2   7:  0  
##  Max.   :14.9   8:  0  
## -------------------------------------------------------- 
## rwines$quality: 6
##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   4.0   Min.   : 4.700   Min.   :0.1600   Min.   :0.0000  
##  1st Qu.: 443.0   1st Qu.: 7.000   1st Qu.:0.3800   1st Qu.:0.0900  
##  Median : 882.5   Median : 7.900   Median :0.4900   Median :0.2600  
##  Mean   : 847.4   Mean   : 8.347   Mean   :0.4975   Mean   :0.2738  
##  3rd Qu.:1224.8   3rd Qu.: 9.400   3rd Qu.:0.6000   3rd Qu.:0.4300  
##  Max.   :1599.0   Max.   :14.300   Max.   :1.0400   Max.   :0.7800  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.03400   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.06825   1st Qu.: 8.00      
##  Median : 2.200   Median :0.07800   Median :14.00      
##  Mean   : 2.477   Mean   :0.08496   Mean   :15.71      
##  3rd Qu.: 2.500   3rd Qu.:0.08800   3rd Qu.:21.00      
##  Max.   :15.400   Max.   :0.41500   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.860   Min.   :0.4000  
##  1st Qu.: 23.00       1st Qu.:0.9954   1st Qu.:3.220   1st Qu.:0.5800  
##  Median : 35.00       Median :0.9966   Median :3.320   Median :0.6400  
##  Mean   : 40.87       Mean   :0.9966   Mean   :3.318   Mean   :0.6753  
##  3rd Qu.: 54.00       3rd Qu.:0.9979   3rd Qu.:3.410   3rd Qu.:0.7500  
##  Max.   :165.00       Max.   :1.0037   Max.   :4.010   Max.   :1.9500  
##     alcohol      quality
##  Min.   : 8.40   3:  0  
##  1st Qu.: 9.80   4:  0  
##  Median :10.50   5:  0  
##  Mean   :10.63   6:638  
##  3rd Qu.:11.30   7:  0  
##  Max.   :14.00   8:  0  
## -------------------------------------------------------- 
## rwines$quality: 7
##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   8.0   Min.   : 4.900   Min.   :0.1200   Min.   :0.0000  
##  1st Qu.: 490.5   1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3050  
##  Median : 941.0   Median : 8.800   Median :0.3700   Median :0.4000  
##  Mean   : 832.2   Mean   : 8.872   Mean   :0.4039   Mean   :0.3752  
##  3rd Qu.:1081.0   3rd Qu.:10.100   3rd Qu.:0.4850   3rd Qu.:0.4900  
##  Max.   :1585.0   Max.   :15.600   Max.   :0.9150   Max.   :0.7600  
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :1.200   Min.   :0.01200   Min.   : 3.00      
##  1st Qu.:2.000   1st Qu.:0.06200   1st Qu.: 6.00      
##  Median :2.300   Median :0.07300   Median :11.00      
##  Mean   :2.721   Mean   :0.07659   Mean   :14.05      
##  3rd Qu.:2.750   3rd Qu.:0.08700   3rd Qu.:18.00      
##  Max.   :8.900   Max.   :0.35800   Max.   :54.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9906   Min.   :2.920   Min.   :0.3900  
##  1st Qu.: 17.50       1st Qu.:0.9948   1st Qu.:3.200   1st Qu.:0.6500  
##  Median : 27.00       Median :0.9958   Median :3.280   Median :0.7400  
##  Mean   : 35.02       Mean   :0.9961   Mean   :3.291   Mean   :0.7413  
##  3rd Qu.: 43.00       3rd Qu.:0.9974   3rd Qu.:3.380   3rd Qu.:0.8300  
##  Max.   :289.00       Max.   :1.0032   Max.   :3.780   Max.   :1.3600  
##     alcohol      quality
##  Min.   : 9.20   3:  0  
##  1st Qu.:10.80   4:  0  
##  Median :11.50   5:  0  
##  Mean   :11.47   6:  0  
##  3rd Qu.:12.10   7:199  
##  Max.   :14.00   8:  0  
## -------------------------------------------------------- 
## rwines$quality: 8
##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   : 268.0   Min.   : 5.000   Min.   :0.2600   Min.   :0.0300  
##  1st Qu.: 462.5   1st Qu.: 7.250   1st Qu.:0.3350   1st Qu.:0.3025  
##  Median : 709.0   Median : 8.250   Median :0.3700   Median :0.4200  
##  Mean   : 826.7   Mean   : 8.567   Mean   :0.4233   Mean   :0.3911  
##  3rd Qu.:1182.5   3rd Qu.:10.225   3rd Qu.:0.4725   3rd Qu.:0.5300  
##  Max.   :1550.0   Max.   :12.600   Max.   :0.8500   Max.   :0.7200  
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :1.400   Min.   :0.04400   Min.   : 3.00      
##  1st Qu.:1.800   1st Qu.:0.06200   1st Qu.: 6.00      
##  Median :2.100   Median :0.07050   Median : 7.50      
##  Mean   :2.578   Mean   :0.06844   Mean   :13.28      
##  3rd Qu.:2.600   3rd Qu.:0.07550   3rd Qu.:16.50      
##  Max.   :6.400   Max.   :0.08600   Max.   :42.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :12.00        Min.   :0.9908   Min.   :2.880   Min.   :0.6300  
##  1st Qu.:16.00        1st Qu.:0.9942   1st Qu.:3.163   1st Qu.:0.6900  
##  Median :21.50        Median :0.9949   Median :3.230   Median :0.7400  
##  Mean   :33.44        Mean   :0.9952   Mean   :3.267   Mean   :0.7678  
##  3rd Qu.:43.00        3rd Qu.:0.9972   3rd Qu.:3.350   3rd Qu.:0.8200  
##  Max.   :88.00        Max.   :0.9988   Max.   :3.720   Max.   :1.1000  
##     alcohol      quality
##  Min.   : 9.80   3: 0   
##  1st Qu.:11.32   4: 0   
##  Median :12.15   5: 0   
##  Mean   :12.09   6: 0   
##  3rd Qu.:12.88   7: 0   
##  Max.   :14.00   8:18

Univariate Plots Section

We see that majority of the wines reported has medium quality 5 or 6. We have very few which have 3,4 and 8.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity looks normally distributed. Median lies at 7.90. There is not huge gap between median and mean.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The volatile.acidity also looks normally distributed. But there is a big differnece between 3rd quartile and the maximum value which is outlier.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The citric.acid plot doesn’t give us much idea about the shape. Let us try to get us some log transform to see clear peaks.

Now citric.acid looks normally distributed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The Maximum value (289) is way far away from 3rd qu. value 62. Huge outliers are making the mean away from the median alot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

We can see free.sulfur.dioxide and total.sulfur.dioxide and are skewed to the right, but log transform shows a uniform distribution. We also have huge outliers to the right side.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

We see that residual.sugar and chlorides are normally distributed. The outliers though are really far away on the right side.

Density has a very nicely normal distributed histogram.

Both density and pH are normall distrubuted too.

We see that alcohol is skewed to the right. But after log transform, it is still skewed slightly to the right.

Univariate Analysis

What is the structure of your dataset?

We have data with 1599 rows and 13 variables, where X is only an id. We have quality as factor and then rest of the variables are continous.

What is/are the main feature(s) of interest in your dataset?

Main feature is quality and we would like to see affect of rest of variables on the quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

We have many variables which describe the chemical composition of the wine. Each one contributes to the quality of the wine. In my opinion if the quantity of certain element is far low or high than the optimum quantity, it affects the quality of wine a lot. We will review that in our search.

Did you create any new variables from existing variables in the dataset?

No. I will creat, along my analysis, as needed.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some variables distrubtion was skewed to the right. I sued log transform to get better understanding of distribution.

We also altered the x xcale to focus on the area we are interested and to avoid outliers.

Bivariate Plots Section

We have seen in description that higher levels of valitle acidity can make the taset of wine unpleasant. That hints us that there is a link between the quality and valitle.acidity.

We see that as the volatile.acidity decreases, the quality of wine increases, as expected.

WE also see from description document “that citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines”. Let us find out how citric.acid affects the quality.

As evident from plots, increase in citric acid does increase the quality of wine. They are possitively correlated.

It looks like the higher the amount of alcohol content in a wine, the better the score it receives, but this effect only appears in wines with a quality of six or more, having the rest similar median values.

We see sulphates are slightly possitive correlated to quality. Not as much as other variables, but there is some.

It seems that there is a negetive relationship between density and quality.

As we know that increased citric.acid will result in lower pH. So we can see here that the high quality wines which have higher citric.acid have lower pH.Althogh it is very light acid so do not know how precise thes values are.

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$volatile.acidity and rwines$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

We see that there is a good negetive correlation between valitaile.acidity and citric.acid (-0.552)

We can also see from scatter plot that valitile.acidity is negetively corrrelated(-0.552 ) to citric.acid.

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$pH and rwines$citric.acid
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5756337 -0.5063336
## sample estimates:
##        cor 
## -0.5419041

We also see that pH and citric acid are negetivly correlated. More concentration of acid will result in more acidity i.g lower pH value.

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$density and rwines$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Similarly we see negetive correlation (-0.496) between density and alcohol.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We observed that volatile.acidity, citric.acid, density have direct affect on the quality of wine. The valitile.acidity and density had negetive relationship, but citric.acid had positive relationship.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We ovserved that citric.acid and volatile.acidity are negetively related. This could be reason that high quality wines have more citirc acid and because of that percentage of valatile acids is less and less volatility.

What was the strongest relationship you found?

We saw that citric.acid has the strongest relationship. Although we don’t have analysed all possible relationships yet.

WE can see a clear affect that when citric acid increases, the quality increases because it increases the freshness of the wine.

Multivariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$volatile.acidity and rwines$density
## t = 0.88044, df = 1597, p-value = 0.3788
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02702409  0.07097074
## sample estimates:
##        cor 
## 0.02202623

We see that there is no relationship (0.022) from scatter plot of volatile.aciddity and density.

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$volatile.acidity and rwines$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

As we found out eariler that there is a negetive relationship between citric acide and volatile.acidity. It is also evident from scatter plot and the correlation coefficient(-0.552)

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$citric.acid and rwines$density
## t = 15.665, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3216809 0.4066925
## sample estimates:
##       cor 
## 0.3649472

There is a very slightly relationship between citric.acid and density.

As we know that alcohol is very light than water. So high alcohol should mean low density. Let us find out relationship between density and alcohol.

## 
##  Pearson's product-moment correlation
## 
## data:  rwines$alcohol and rwines$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

And as expected we find out negetive relationship(-0.496).

As expected, the plot shows that high volatile.acidity and low citric.acid are usually an indicator of bad quality

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As we have been seeing before, we see that if the citric.acid is higher and volatile.acidity is lower, the wines are higher quality. We also saw that high alocol content results in lower density, which is a property of high quality wines.

We also found that high quality wines ### Were there any interesting or surprising interactions between features? All relationships are as expected.


Final Plots and Summary

Plot One

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Description One

A very high number of wines have medium quality (5 or 6). The wines of quality 3 (10),4 (53) and 8 (18) are 5% of the total observations. Around 10% are quality level 7 (199).While remaining 85% is quality 5 (681) and 6 (638). We have zero observation for quality 0,1 and 2. We also don’t have any observations for quality 9 and 10.

As the data is subjective, This might be a problem because we need more data sets for high quality wines to figure out more accurately about what makes a good wine or vice versa.

Plot Two

Description Two

The highest correlated () box plot. There is a very big differnet between lowest and highest quality wines. The difference between lowest quality (1,2) and (7,8) is not as much high as there is differnece from the medium (5 ,6) quality wines.

We also see some overlaps between the box areas, which means a lot of low quality wines still have same citric acid as highest quality ones.

Plot Three

Description Three

Three most influential varialbes quality, Volatile Acidity and Citric Acid are shown in this scatter plot. Each quality level is shown in its differnet color. There is negetive correlation (-0.52) between volatile acidity and Citric Acid.

We observe that the lowest quality wines are towards the higher end of the plot. But alow observe that quality 8 is at higher level than level 7. This shows that there might be a correlatin but it is not perfectly linear.


Reflection

The data consists of 1599 observations. It contains 13 varialbe oout of which 1 is only the id. We also learnt from description that the data is sensory data and quality rating calculated by taking the median of the scores assigned by three or more experts, in a scale of zero to ten. That is the reason why we do not have zero or 10 values.

Because the data is subjective, so we expected some difficulties getting some idea about what factors increase the quality rating of the wine. Although we couldn’t find very strong relationships, but still we found some relationships, which can help to predict quality of wine. E.g Acetic Acid, Ctiric Acid,pH and Alcohol content can tell us about quality of wine somehow.

The analyis could be imporved 1. If we have same amount of observations for all qualities. That will make easier to find relatiionships between different variables. 2. It would also help if we had data for wines with quality 1 and 2 or 9 and 10. That would have helped us to see more clearer trends as the variable might have extremed towards the extreme levels. 3. We als have to get unbiased observational data, which has no effect of color, temperature or age factor of the subjects.